Constructing Complex Plots

Combining Data Sources

Elizabeth King
Kevin Middleton

This week

  1. Combining data sources
  2. Adding elements
  3. Highlighting elements
  4. Reusable themes

“Complex plots”

  • Combining geoms
  • Data from different sources (data.frames, regressions, etc.)
  • Non-standard elements or axis labels (text with math symbols)
  • Custom colors or gradients
  • Highlighting different elements with color, labels or arrows

Example

Example

Example

Zooplankton diversity

Diversity of zooplankton (Diversity) prey in each of 5 replicate Blocks of three Treatment levels.1

# A tibble: 15 × 3
   Treatment Diversity Block
   <fct>         <dbl> <fct>
 1 Control         4.1 1    
 2 Low             2.2 1    
 3 High            1.3 1    
 4 Control         3.2 2    
 5 Low             2.4 2    
 6 High            2   2    
 7 Control         3   3    
 8 Low             1.5 3    
 9 High            1   3    
10 Control         2.3 4    
11 Low             1.3 4    
12 High            1   4    
13 Control         2.5 5    
14 Low             2.6 5    
15 High            1.6 5    

Points by groups

ggplot(ZP, aes(x = Treatment, y = Diversity)) +
  geom_point(size = 2,
             position = position_jitter(width = 0.05,
                                        seed = 6734747))

Points by groups

Boxplot

  • Median
  • 1st and 3rd quartiles (25th and 75th percentiles)
  • Whiskers calculated variously (here ±1.5 * IQR)
  • “Outliers” outside the whiskers
ggplot(ZP, aes(x = Treatment, y = Diversity)) +
  geom_boxplot()

Boxplot

Boxplot with points

  • Add points on top of the boxplot (ggplot adds geoms sequentially)
  • Turn off outlier highlighting (outlier.shape = NA)
ggplot(ZP, aes(x = Treatment, y = Diversity)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(size = 2,
             position = position_jitter(width = 0.05,
                                        seed = 6734747))

Boxplot with points

Precomputing quantities

SEM <- function(x) {
  return(sd(x) / sqrt(length(x)))
}

ZP_means <- ZP |> 
  group_by(Treatment) |> 
  summarise(mean_Diversity = mean(Diversity),
            upper_bound = mean_Diversity + 2 * SEM(Diversity),
            lower_bound = mean_Diversity - 2 * SEM(Diversity),
            .groups = "drop")

ZP_means
# A tibble: 3 × 4
  Treatment mean_Diversity upper_bound lower_bound
  <fct>              <dbl>       <dbl>       <dbl>
1 Low                 2           2.51       1.49 
2 Control             3.02        3.65       2.39 
3 High                1.38        1.76       0.998

Plots from multiple data sources

  • Start with an empty ggplot()
  • Use data argument to geom_
ggplot() +
  geom_point(data = ZP, aes(x = Treatment, y = Diversity),
             size = 2,
             position = position_jitter(width = 0.05, seed = 6734747)) +
  geom_point(data = ZP_means,
             aes(x = Treatment, y = mean_Diversity),
             size = 4,
             color = "gray50") +
  geom_errorbar(data = ZP_means,
                aes(x = Treatment, ymin = lower_bound, ymax = upper_bound),
                width = 0.1, size = 1,
                color = "gray50")

Plots from multiple data sources

Reaction norms for paired data

Antibody levels in Red-winged blackbirds before and after experimental manipulation of testosterone1

   ID Before After
1   1    105    85
2   2     50    74
3   3    136   145
4   4     90    86
5   5    122   148
6   6    132   148
7   7    131   150
8   8    119   142
9   9    145   151
10 10    130   113
11 11    116   118
12 12    110    99
13 13    138   150

Pivoting

BB_long <- BB |> 
  pivot_longer(cols = -ID,
               names_to = "Timepoint",
               values_to = "Antibody_level")

BB_long
# A tibble: 26 × 3
   ID    Timepoint Antibody_level
   <fct> <chr>              <int>
 1 1     Before               105
 2 1     After                 85
 3 2     Before                50
 4 2     After                 74
 5 3     Before               136
 6 3     After                145
 7 4     Before                90
 8 4     After                 86
 9 5     Before               122
10 5     After                148
# ℹ 16 more rows

Plotting reaction norms

  • group = ID tells geom_line() to associate lines with IDs.
ggplot(BB_long, aes(x = Timepoint, y = Antibody_level, group = ID)) +
  geom_point() +
  geom_line()

Plotting reaction norms

Ordering factors

Also many options in the forcats package (fct_ functions) when you already have a factor.

BB_long <- BB_long |> 
  mutate(Timepoint = factor(Timepoint, levels = c("Before", "After")))

ggplot(BB_long, aes(x = Timepoint, y = Antibody_level, group = ID)) +
  geom_point() +
  geom_line()

Ordering factors

Adding a mean and error bars

  • stat_summary() can add the output of summary functions.
  • mean_cl_boot give the mean and bootstrapped 95% CI. (requires Hmisc)
  • aes(group = -1) ungroups the data so that the statistics will calculate for each Timepoint
ggplot(BB_long, aes(x = Timepoint, y = Antibody_level, group = ID)) +
  geom_point() +
  geom_line() +
  stat_summary(aes(group = -1),
               fun.data = "mean_cl_boot",
               color = "magenta",
               size = 1)

Adding a mean and error bars

Horizontal, Vertical, and ab lines

ggplot() +
  geom_hline(yintercept = -5:5, color = "orangered") +
  geom_vline(xintercept = -5:5, color = "orangered") +
  geom_abline(slope = seq(1, 10, by = 0.5), intercept = 0)

Horizontal, Vertical, and ab lines

Horizontal, Vertical, and ab lines: refined

ggplot() +
  geom_hline(yintercept = -5:5, color = "orangered") +
  geom_vline(xintercept = -5:5, color = "orangered") +
  geom_abline(slope = seq(1, 10, by = 0.5), intercept = 0) +
  coord_equal()

Horizontal, Vertical, and ab lines: refined

Adding regression lines

  1. geom_smooth()
    • Can handle a range of models from lm() to GLM and GAM
    • Splits by aesthetics automatically
  2. Compute regression and use predict() or another helper function
    • You handle all the prediction
    • Necessary for more complex models (e.g., mixed/multilevel)

Bird species richness

Bird species richness in different habitat patches sampled in Jamaica.1

   Patch_ID n_Species Landscape_type       Area    log_Area
1      ag1a        24    Agriculture  3.5101825  0.54532969
2      ag1b        15    Agriculture  0.6155155 -0.21076101
3      ag1c        25    Agriculture  2.2350471  0.34928667
4      ag1d        35    Agriculture  8.2798820  0.91802415
5      ag2a        32    Agriculture  1.3736535  0.13787720
6      ag2b        40    Agriculture 59.2797938  1.77290668
7      ag2c        27    Agriculture  0.5580768 -0.25330600
8      ag2d        37    Agriculture  4.3000000  0.63346846
9      ag3a        40    Agriculture 19.6362412  1.29305836
10     ag3b        36    Agriculture  5.4911675  0.73966469
11     ag3c        27    Agriculture  1.1923333  0.07639768
12     ag3d        33    Agriculture  3.0919172  0.49022785
13     ag3e        25    Agriculture  1.9279442  0.28509446
14     ag3f        17    Agriculture  0.9881100 -0.00519472
15     ag4a        12    Agriculture  0.9714663 -0.01257228
16     ag4b        37    Agriculture  5.4848477  0.73916458
17     ag4c        33    Agriculture  4.4454332  0.64791409
18     ag4d        36    Agriculture 54.1524979  1.73361849
19     ag4e        35    Agriculture 18.7527709  1.27306545
20     ag4f        25    Agriculture  2.6943908  0.43046059
21     ag5a        37    Agriculture  9.1428019  0.96107931
22     ag5b        19    Agriculture  1.2075253  0.08189623
23     ag5c        37    Agriculture 32.5404276  1.51242326
24     ag5d        33    Agriculture  3.8893382  0.58987571
25     ag5e        31    Agriculture  7.1913356  0.85680956
26      b1a        37        Bauxite 13.3258040  1.12469342
27      b1b        23        Bauxite  1.7291887  0.23784238
28      b1c        16        Bauxite  0.5811953 -0.23567787
29      b1d        27        Bauxite  2.2335325  0.34899229
30      b1e        24        Bauxite  3.4718332  0.54055885
31      b2a        35        Bauxite  8.7560195  0.94230672
32      b2b        17        Bauxite  1.8562012  0.26862505
33      b2c        22        Bauxite  5.1314754  0.71024225
34      b2d        22        Bauxite  2.1336472  0.32912261
35      b2e        12        Bauxite  1.3722959  0.13744776
36      b3a        26        Bauxite  4.3642118  0.63990582
37      b3b        26        Bauxite  3.9374141  0.59521110
38      b3c        11        Bauxite  1.0991556  0.04105919
39      b3d        26        Bauxite  2.1061711  0.32349365
40      b3e        27        Bauxite  5.6320607  0.75066733
41      b3f        11        Bauxite  1.2987237  0.11351676
42      b4a        23        Bauxite  4.2931978  0.63278090
43      b4b        33        Bauxite 10.0659529  1.00285489
44      b4c        20        Bauxite  1.5915311  0.20181513
45      b4d        12        Bauxite  1.4673604  0.16653678
46      b4e        13        Bauxite  0.8040290 -0.09472829
47    Ref1a        36         Forest  5.5000000  0.74036269
48    Ref1b        20         Forest  1.2000000  0.07918125
49    Ref1c        25         Forest  3.5000000  0.54406804
50    Ref1d        20         Forest  1.2000000  0.07918125
51    Ref1e        32         Forest  7.5000000  0.87506126
52    Ref1g        16         Forest 15.0000000  1.17609126
53    Ref2a        34         Forest  3.5000000  0.54406804
54    Ref2b        29         Forest  2.2000000  0.34242268
55    Ref2c        40         Forest  5.5000000  0.74036269
56    Ref2d        29         Forest  2.2000000  0.34242268
57    Ref2e        22         Forest  2.2000000  0.34242268
58    Ref3a        14         Forest  1.2000000  0.07918125
59    Ref3b        29         Forest  5.5000000  0.74036269
60    Ref3c        27         Forest  3.5000000  0.54406804
61    Ref3d        30         Forest  2.2000000  0.34242268
62    Ref4a        23         Forest  1.2000000  0.07918125
63    Ref4b        30         Forest  3.5000000  0.54406804
64    Ref4c        26         Forest  2.2000000  0.34242268
65    Ref4d        26         Forest  2.2000000  0.34242268
66    Ref5a        18         Forest  1.2000000  0.07918125
67    Ref5b        31         Forest  5.5000000  0.74036269
68    Ref5c        28         Forest  5.5000000  0.74036269
69    Ref5d        23         Forest  3.5000000  0.54406804
70    Ref5e        19         Forest  1.2000000  0.07918125
71    Ref6a        18         Forest  1.2000000  0.07918125
72    Ref6b        23         Forest  2.2000000  0.34242268
73    Ref6c        27         Forest  3.5000000  0.54406804
74    Ref6d        26         Forest  3.5000000  0.54406804
75    Ref6e        33         Forest  7.5000000  0.87506126
76    Ref6f        31         Forest 25.0000000  1.39794001
77      u1a        16          Urban  2.7146097  0.43370740
78      u1b        30          Urban  4.6795641  0.67020540
79      u1c        21          Urban  1.6978078  0.22988851
80      u1d        14          Urban  1.4137406  0.15036974
81      u1e        19          Urban  6.8763832  0.83736007
82      u2a        31          Urban  4.6506448  0.66751317
83      u2b        16          Urban  1.4327452  0.15616897
84      u2c        22          Urban  3.2899767  0.51719283
85      u2d        17          Urban  1.2587029  0.09992323
86      u2e        23          Urban  4.4914799  0.65238946
87      u3a        34          Urban  6.7841003  0.83149226
88      u3b        24          Urban  5.1748479  0.71389759
89      u3c        25          Urban  3.1385548  0.49672972
90      u3d        15          Urban  1.3851313  0.14149094
91      u4a        25          Urban  1.8141023  0.25866177
92      u4b        19          Urban  1.0193984  0.00834394
93      u4c        24          Urban  2.9128820  0.46432290
94      u4d        23          Urban  2.6707253  0.42662921
95      u4e        27          Urban  8.3423559  0.92128871

geom_smooth()

Defaults to a loess smoother with confidence interval

ggplot(BSR, aes(x = log_Area, y = n_Species, color = Landscape_type)) +
  geom_point() +
  geom_smooth()

geom_smooth()

geom_smooth(formula = y ~ x, method = "lm")

“Standard” linear regression

ggplot(BSR, aes(x = log_Area, y = n_Species, color = Landscape_type)) +
  geom_point() +
  geom_smooth(formula = y ~ x, method = "lm")

geom_smooth(formula = y ~ x, method = "lm")

Removing the confidence interval

se = FALSE

p1 <- ggplot(BSR, aes(x = log_Area, y = n_Species, color = Landscape_type)) +
  geom_point() +
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE)
p1

Removing the confidence interval

Removing the grouping

One line through all the points: aes(group = -1)

ggplot(BSR, aes(x = log_Area, y = n_Species, color = Landscape_type)) +
  geom_point() +
  geom_smooth(aes(group = -1), formula = y ~ x, method = "lm", se = FALSE)

Removing the grouping

“Manual” prediction

fm <- lm(n_Species ~ log_Area + Landscape_type, data = BSR)
summary(fm)
  • Slope for log_Area common to all levels of Landscape_type
  • Separate intercepts for each Landscape_type

“Manual” prediction


Call:
lm(formula = n_Species ~ log_Area + Landscape_type, data = BSR)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.5508  -2.9911   0.2049   3.1838  10.7961 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             22.270      1.240  17.958  < 2e-16 ***
log_Area                12.271      1.243   9.869 5.35e-16 ***
Landscape_typeBauxite   -5.351      1.455  -3.677 0.000401 ***
Landscape_typeForest    -2.151      1.321  -1.629 0.106855    
Landscape_typeUrban     -5.506      1.488  -3.701 0.000369 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.84 on 90 degrees of freedom
Multiple R-squared:  0.6043,    Adjusted R-squared:  0.5867 
F-statistic: 34.36 on 4 and 90 DF,  p-value: < 2.2e-16

predict()

predict() has methods for many regression-like functions:

  • predict.lm()
  • predict.glm()
  • etc.
  • Type predict. and then Tab to see options

By default returns the predicted values for observed data.

predict(fm)

predict()

       1        2        3        4        5        6        7        8 
28.96159 19.68346 26.55591 33.53499 23.96166 44.02541 19.16138 30.04315 
       9       10       11       12       13       14       15       16 
38.13711 31.34631 23.20724 28.28542 25.76819 22.20600 22.11547 31.34017 
      17       18       19       20       21       22       23       24 
30.22042 43.54330 37.89177 27.55201 34.06333 23.27471 40.82897 29.50822 
      25       26       27       28       29       30       31       32 
32.78382 30.72001 19.83730 14.02664 21.20124 23.55199 28.48191 20.21504 
      33       34       35       36       37       38       39       40 
25.63421 20.95742 18.60534 24.77110 24.22264 17.42254 20.88834 26.13027 
      41       42       43       44       45       46       47       48 
18.31168 24.68366 29.22491 19.39520 18.96230 15.75626 29.20387 21.09039 
      49       50       51       52       53       54       55       56 
26.79510 21.09039 30.85678 34.55078 26.79510 24.32067 29.20387 24.32067 
      57       58       59       60       61       62       63       64 
24.32067 21.09039 29.20387 26.79510 24.32067 21.09039 26.79510 24.32067 
      65       66       67       68       69       70       71       72 
24.32067 21.09039 29.20387 29.20387 26.79510 21.09039 21.09039 24.32067 
      73       74       75       76       77       78       79       80 
26.79510 26.79510 30.85678 37.27313 22.08610 24.98821 19.58500 18.60921 
      81       82       83       84       85       86       87       88 
27.03939 24.95517 18.68037 23.11056 17.99017 24.76958 26.96739 25.52436 
      89       90       91       92       93       94       95 
22.85946 18.50025 19.93808 16.86638 22.46178 21.99924 28.06930 

Grid construction with expand_grid()

predict() option newdata allows passing a set of values for which to predict.

  • Generate predictors over a range of log_Area for each level of Landscape_type
  • Variable names must match the right-hand side of the formula
  • tidyr function expand_grid() (or crossing())
Preds <- expand_grid(
  log_Area = seq(-0.5, 2, length.out = 200),
  Landscape_type = levels(BSR$Landscape_type) |> factor()
)
Preds

Grid construction with expand_grid()

# A tibble: 800 × 2
   log_Area Landscape_type
      <dbl> <fct>         
 1   -0.5   Agriculture   
 2   -0.5   Bauxite       
 3   -0.5   Forest        
 4   -0.5   Urban         
 5   -0.487 Agriculture   
 6   -0.487 Bauxite       
 7   -0.487 Forest        
 8   -0.487 Urban         
 9   -0.475 Agriculture   
10   -0.475 Bauxite       
# ℹ 790 more rows

Generate predictions

  • Pass newdata to predict()
Preds <- Preds |> 
  mutate(Predicted = predict(fm, newdata = Preds))
Preds
# A tibble: 800 × 3
   log_Area Landscape_type Predicted
      <dbl> <fct>              <dbl>
 1   -0.5   Agriculture         16.1
 2   -0.5   Bauxite             10.8
 3   -0.5   Forest              14.0
 4   -0.5   Urban               10.6
 5   -0.487 Agriculture         16.3
 6   -0.487 Bauxite             10.9
 7   -0.487 Forest              14.1
 8   -0.487 Urban               10.8
 9   -0.475 Agriculture         16.4
10   -0.475 Bauxite             11.1
# ℹ 790 more rows

Plot new predictions

p2 <- ggplot() +
  geom_point(data = BSR,
             aes(x = log_Area, y = n_Species, color = Landscape_type)) +
  geom_line(data = Preds,
            aes(x = log_Area, y = Predicted, color = Landscape_type),
            size = 1)
p2

Plot new predictions

Compare

Making confidence interval bands

95% prediction interval

  • Interval expected to include 95% of all new observations
  • Wider than the CI for the mean estimate
Preds <- Preds |>
  bind_cols(predict(fm, newdata = Preds,
                    interval = "prediction", level = 0.95) |> 
              as_tibble())
Preds

Making confidence interval bands

# A tibble: 800 × 6
   log_Area Landscape_type Predicted   fit   lwr   upr
      <dbl> <fct>              <dbl> <dbl> <dbl> <dbl>
 1   -0.5   Agriculture         16.1  16.1 5.94   26.3
 2   -0.5   Bauxite             10.8  10.8 0.684  20.9
 3   -0.5   Forest              14.0  14.0 3.91   24.1
 4   -0.5   Urban               10.6  10.6 0.485  20.8
 5   -0.487 Agriculture         16.3  16.3 6.11   26.5
 6   -0.487 Bauxite             10.9  10.9 0.845  21.0
 7   -0.487 Forest              14.1  14.1 4.07   24.2
 8   -0.487 Urban               10.8  10.8 0.646  20.9
 9   -0.475 Agriculture         16.4  16.4 6.27   26.6
10   -0.475 Bauxite             11.1  11.1 1.01   21.2
# ℹ 790 more rows

Plotting confidence interval bands

  • geom_line() adds the mean prediction fit
  • geom_ribbon() adds the shaded bands bound by lwr and upr
ggplot() +
  geom_point(data = BSR,
             aes(x = log_Area, y = n_Species, color = Landscape_type)) +
  geom_line(data = Preds,
            aes(x = log_Area, y = fit, color = Landscape_type),
            size = 1) +
  geom_ribbon(data = Preds,
              aes(x = log_Area, ymin = lwr, ymax = upr, fill = Landscape_type))

Plotting confidence interval bands

Reordering plot elements

  • Hiding the legend for geom_ribbon
  • Relabeling axes
  • Changing the color palette
ggplot() +
  geom_ribbon(data = Preds,
              aes(x = log_Area, ymin = lwr, ymax = upr, fill = Landscape_type),
              alpha = 0.1,
              show.legend = FALSE) +
  geom_line(data = Preds,
            aes(x = log_Area, y = fit, color = Landscape_type),
            size = 1) +
  geom_point(data = BSR,
             aes(x = log_Area, y = n_Species, color = Landscape_type)) +
  labs(y = "Number of Species", x = "Area (log Ha)") +
  scale_color_paletteer_d(`"rcartocolor::Bold"`, name = "Landscape Type") +
  theme(legend.position = "inside",
        legend.position.inside = c(0.8, 0.15))

Reordering plot elements